Extraction of Citation Data from Websites based on Visual Cues

نویسنده

  • Tim Repke
چکیده

Every scientific article, be it an essay or a paper in a journal, provides citations and a bibliography to support its arguments. Publishers and universities require their academics to format their bibliography using predefined styles like APA, MLA or Harvard, to name just a few. Obviously it is a tedious job to do that by hand, which is why there is a multitude of software to format the references following style guides, provided the required information is present in a structured format. Some services (citation generators), like CiteThisForMe1, EasyBib2 or RefME3 even try to provide this information automatically, so the user only needs to enter an identifier like an ISBN, DOI or URL to add a specific work to the bibliography. This thesis focuses on the case, where a web resource needs to be referenced. One would have to open the website and find the relevant information, such as the title, author, date of publication and publisher, needed to reference this specific resource. In the scope of this work, this information will be called citation data. The aforementioned citation generators automate this process, so that one only needs to enter a URL. The software tries to extract data required to cite this article. For example, to reference a news article on BBC online, the citation generator 1) would open the URL, 2) extract the title, author, date of publication and publisher, and 3) finally return a properly formatted reference according to the selected style. In my work for RefME, a start-up creating a citation management platform, I created and enhanced the underlying scraper service, which, given a URL, fetches the HTML code of that website and, based on hard-coded identifiers, extracts citation data. This method is very sensitive to the way a website’s code is written, which inspired a different approach. Based on the experience gathered, it seems that using visual cues might improve the precision of the citation data extraction. The primary advantage is, that those are independent

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Investigating common visual symbols in photography of the Islamic Revolution based on Pearce's pattern

The Islamic revolution, has remaining very influential on the way people lived, changed the path of art and especially the photography of Iran. There are many photographs of the Revolution. By examining the visual cues, the changes in the cultural and artistic trends of that period can be better understood. Are there any shared signs between the images in the photograph and do they share certai...

متن کامل

Reduced-Reference Image Quality Assessment based on saliency region extraction

In this paper, a novel saliency theory based RR-IQA metric is introduced. As the human visual system is sensitive to the salient region, evaluating the image quality based on the salient region could increase the accuracy of the algorithm. In order to extract the salient regions, we use blob decomposition (BD) tool as a texture component descriptor. A new method for blob decomposition is propos...

متن کامل

Revisiting Web Data Extraction Using In-Browser Structural Analysis and Visual Cues in Modern Web Designs

Recent trends in website design have an impact on methods used for web data extraction. Many existing methods rely on structural analysis of web pages and, with the introduction of CSS, table-based layouts are no longer used, while responsive design means that layout and presentation are dependent on browsing context which also makes the use of visual clues more complex. We present DeepDesign, ...

متن کامل

Feature Extraction of Visual Evoked Potentials Using Wavelet Transform and Singular Value Decomposition

Introduction: Brain visual evoked potential (VEP) signals are commonly known to be accompanied by high levels of background noise typically from the spontaneous background brain activity of electroencephalography (EEG) signals. Material and Methods: A model based on dyadic filter bank, discrete wavelet transform (DWT), and singular value decomposition (SVD) was developed to analyze the raw data...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016